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Abstract 

Fast  and  efficient  storage,  browsing,  indexing,  and  retrieval  of  video  is  necessary  for  the 
development  of  various  multimedia  database  applications.  Given  that  video  is  typically 
stored  efficiently  in  a  compressed  format,  if  we  can  analyze  the  compressed  representation 
directly,  we  can  avoid  the  costly  overhead  of  decompressing  and  operating  at  the  pixel 
level.  Compressed  domain  parsing  of  video  has  been  presented  in  earlier  work  where 
key  frames  are  identihed  for  shots,  subshots,  and  scenes.  In  this  paper,  we  describe  key 
frame  selection,  feature  extraction,  indexing,  and  retrieval  techniques  that  are  directly 
applicable  to  MPEG-compressed  video.  We  develop  a  frame-type  independent  represen¬ 
tation  of  the  various  types  of  frames  present  in  an  MPEG  video  in  which  all  frames  can 
be  considered  equivalent.  Features  are  derived  from  the  available  DCT,  macroblock,  and 
motion  vector  information  and  mapped  to  a  low-dimensional  space  where  they  can  be 
accessed  using  standard  database  techniques.  The  spatial  information  is  used  as  primary 
index  while  the  temporal  information  is  used  to  enhance  the  robustness  of  the  system 
during  the  retrieval  process.  The  techniques  presented  enable  fast  archiving,  indexing, 
and  retrieval  of  video.  Our  operational  prototype  typically  takes  a  fraction  of  a  second 
to  retrieve  similar  video  scenes  from  our  database,  with  over  95%  success. 
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1  Introduction 


With  the  development  of  various  multimedia  compression  standards  and  signihcant 
increases  in  desktop  computer  performance  and  storage,  the  widespread  exchange  of 
multimedia  information  is  becoming  a  reality.  Video  is  arguably  the  most  popular 
means  of  communication  and  entertainment.  With  this  popularity  comes  an  increase 
in  the  volume  of  video  and  an  increased  need  for  the  ability  to  automatically  sift 
through  and  search  for  relevant  material  stored  in  large  video  databases  (LVDBs). 
Even  with  increases  in  hardware  capabilities  which  make  video  distribution  possible, 
factors  such  as  algorithm  speed  and  storage  costs  are  concerns  that  must  still  be 
addressed. 

With  this  in  mind,  a  hrst  consideration  should  therefore  be  to  attempt  to  increase 
speed  when  using  existing  compression  standards.  Performing  analysis  in  the  com¬ 
pressed  domain  reduces  the  amount  of  effort  involved  in  decompression,  and  providing 
a  means  of  abstracting  the  data  keeps  the  storage  costs  of  the  resulting  feature  set 
low.  Both  of  these  problems  are  active  areas  of  research. 

A  second  consideration  is  that  a  user  who  is  interested  in  searching  for  and  retriev¬ 
ing  video  clips  needs  a  way  to  interface  with  the  database  by  formulating  appropriate 
queries.  These  queries  need  to  be  appropriately  translated  into  a  form  that  can  be 
used  to  search  an  index  and  retrieve  the  matching  clips.  A  typical  approach  to  in¬ 
dexing  and  archiving  video  for  retrieval  requires  parsing  the  video,  extracting  key 
information  from  each  clip  (possibly  a  single  frame),  indexing  the  information,  and 
providing  a  representation  which  allows  accurate  and  efficient  retrieval  based  on  the 
user’s  request. 

Traditional  query-by- content  algorithms  operate  on  the  principle  that  a  query  can 
be  formulated  which  accurately  describes  features  that  can  be  extracted  automatically 
by  the  system,  such  as  color,  texture  and  shape.  In  the  case  of  video  this  approach 
must  be  augmented  to  deal  with  the  additional  temporal  and  spatio-temporal  dimen¬ 
sions. 
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In  our  system,  we  address  these  issues  by  providing  algorithms  which  perform 
these  tasks  on  compressed  video.  In  particular,  we  consider  the  problem  of  extracting 
indexable  features  from  compressed  MPEG  frames,  indexing  the  features  for  each 
clip  and  providing  efficient  query  capabilities.  We  present  techniques  which  provide 
a  framework  within  which  all  types  of  MPEG  frames  can  be  considered  equivalent. 

In  the  next  section,  we  provide  some  background  on  compressed  domain  video 
analysis,  including  a  brief  description  of  MPEG  compression,  and  an  overview  of 
related  work. 

2  Background 

The  analysis  of  compressed  video  can  proceed  in  one  of  two  fundamental  ways.  The 
hrst  is  by  decompressing  some  or  all  of  the  video  and  using  the  individual  frames 
to  gather  information  about  various  characteristics  of  the  video  such  as  content  or 
motion,  and  extracting  indexable  features  in  the  pixel  domain.  The  second  involves 
exploiting  encoded  information  contained  in  the  compressed  representation  without 
incurring  the  overhead  of  complete  decompression. 

The  problem  of  video  retrieval  arises  when  a  user  or  an  application  poses  queries 
to  a  large  database  of  video  clips  in  some  format,  and  a  fast,  efficient,  and  precise 
reply  is  required.  If  the  query  is  an  image  or  another  video  and  the  user  requests 
that  the  system  retrieve  similar  clips,  it  is  called  a  “query-by-example”.  In  this  case, 
the  main  challenge  in  comparing  clips  or  frames  of  a  clip  is  providing  a  suitable 
dehnition  of  what  it  means  for  two  clips  to  be  similar.  In  the  pixel  domain,  color- 
based  similarity  can  be  implemented  using,  for  example,  features  extracted  from 
color  histograms,  or  a  pixel  by  pixel  comparison,  though  the  latter  is  computationally 
expensive.  Other  methods  of  specifying  queries  may  involve  the  user  sketching  the 
shapes  that  he/she  is  interested  in,  or  providing  textual  queries  to  access  annotations 
that  were  derived  automatically  or  entered  manually.  Other  features  that  can  be 
used  to  dehne  similarity  which  have  proven  useful  in  related  domains  include  image 
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texture,  object  shape,  and  spatial  relations  between  objects. 

With  an  increasing  amount  of  video  available,  and  given  the  query  challenges 
stated  above,  automated  techniques  for  searching  large  video  databases  in  a  fast, 
efficient  manner  are  necessary.  Since  decompressing  video  is  very  time  consuming,  we 
explore  techniques  for  analyzing  video  using  the  information  available  in  an  MPEG- 
compressed  video  stream. 

2.1  MPEG  stream 

The  Moving  Picture  Expert  Group  (MPEG)  standard  for  digital  video  is  arguably 
the  most  widely  accepted  international  video  compression  standard.  The  MPEG 
encoding  algorithm  [12]  relies  on  two  basic  techniques:  block-based  motion  compen¬ 
sation  to  capture  temporal  redundancy,  and  transform-domain-based  compression  to 
capture  spatial  redundancy.  Motion  compensation  techniques  are  applied  using  both 
predictive  and  interpolative  techniques.  The  prediction  error  signal  is  further  com¬ 
pressed  using  spatial  redundancy  reduction  techniques.  The  fact  that  temporal  and 
spatial  changes  are  fundamental  for  segmentation  makes  MPEG  an  ideal  candidate 
for  compressed  domain  analysis. 

An  MPEG  stream  [If]  consists  of  three  types  of  frames  —  1,P,  and  B  frames  — 
occurring  in  a  repetitive  pattern  (called  the  IPB  pattern).  An  1  frame  is  an  anchor 
frame  that  is  simply  a  .JPEG  encoding  [15]  of  its  corresponding  pixel  image.  A  P 
frame  is  predicted  from  its  preceding  1  or  P  frame,  and  a  B  frame  is  predicted  from 
both  its  preceding  and  following  1  and/or  P  frames.  1  and  P  frames  are  collectively 
called  reference  frames  since  only  these  two  types  of  frames  are  used  during  prediction 
and  interpolation  of  other  frames.  Eigure  1  shows  an  example  of  the  predictive  rela¬ 
tionships.  Eor  most  clips,  such  as  the  example  shown  in  Eigure  1,  the  IPB  pattern 
is  regular  with  the  number  of  B  frames  between  reference  frames  and  the  number 
of  P  frames  between  1  frames  being  constant.  Clips  with  irregular  IPB  patterns  do 
not  pose  a  problem  to  our  system;  in  later  sections  we  describe  techniques  to  han- 
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die  such  cases.  All  three  types  of  frames  are  ultimately  encoded  using  2D  Discrete 
Cosine  Transform  (DCT),  quantization,  and  run-length  coding.  In  the  1  frames,  the 
DCT  information  is  derived  directly  from  the  image  samples,  but  in  the  P  and  B 
frames,  the  DCT  information  is  derived  from  the  residual  error  after  prediction.  The 
motion  compensation  information  represented  as  vectors  is  also  differentially  coded. 
We  use  the  term  ‘motion  vector’  to  refer  to  the  block-based  motion  compensation 
vector.  Each  frame  in  the  clip  is  divided  into  blocks  of  f6xf6  pixels  called  Mac¬ 
roblocks  (MBs),  and  most  of  the  low-level  processing,  including  spatial  and  temporal 
redundancy  reduction,  is  performed  at  this  level  of  spatial  resolution. 

During  the  encoding  process,  a  procedure  is  run  on  each  macroblock  in  a  P  frame, 
to  see  if  that  macroblock  can  be  predicted  from  its  corresponding  block  in  the  previous 
1/P  frame  with  a  possible  offset  to  compensate  for  motion.  If  it  appears  that  little 
would  be  gained  in  predicting  and  encoding,  then  that  block  is  not  predicted  but  is 
intra-coded.  This  typically  occurs  if  the  current  macroblock  does  not  have  much  in 
common  with  the  previous  one.  Every  P  frame  therefore  consists  of  intra-coded  MBs 
and  forward-predicted  MBs.  There  are  also  skipped  MBs  which  resemble  forward- 
predicted  MBs,  since  they  are  identical  to  some  16x16  block  of  the  previous  reference 
frame. 

Similarly,  for  each  macroblock  in  a  B  frame,  a  test  is  made  to  see  if  it  can  be 
predicted  from  both  its  previous  and  next  1/P  frames,  its  previous  1/P  frame  alone, 
its  next  1/P  frame  alone,  or  if  it  cannot  be  predicted  from  any  reference  frame  and 
thus  must  be  intra-coded.  An  MB  can  also  be  skipped  if  it  is  identical  to  some  part  of 
the  reference  frames  and  the  residual  error  becomes  zero.  Each  skipped  MB  behaves 
identically  to  the  previous  non-skipped  MB  in  the  current  slice  of  the  current  frame. ^ 
In  other  words,  the  hrst  MB  of  a  slice  cannot  be  skipped,  and  if  the  MB  type  and 
motion  vectors  of  this  non-skipped  MB  are  sufficient  for  any  continuous  string  of  MBs 
in  the  slice  immediately  following  the  non-skipped  MB,  those  MBs  are  skipped.  Each 

^MBs  in  a  single  frame  are  grouped  into  slices  of  MBs,  as  a  means  of  layering. 
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I  B  B  P  B  B  I 


Figure  1:  Predictive  relationships  between  I,  P  and  B  frames 

MB  in  a  B  frame  can  be  of  any  one  of  these  bve  types. 

The  MPEG  standard  does  not  specify  which  techniques  are  to  be  used  for  motion 
estimation  and  prediction  during  the  encoding  process.  Thus,  little  can  be  assumed 
about  the  quality  of  the  prediction  obtained  while  encoding.  Nevertheless,  we  assume 
that  the  prediction  techniques  are  reasonable  enough  to  yield  reliable  motion  vector 
and  macroblock  data.  The  reader  should  refer  to  two  articles  by  Le  Gall  [If,  12]  for 
more  information  on  MPEG. 

The  computationally  expensive  step  in  decompressing  MPEG  video  is  the  inverse 
DOT  (IDCT),  which  should  be  avoided  if  possible.  The  information  available  in  the 
compressed  representation  without  performing  IDCT  includes  the  type  of  each  MB, 
the  DOT  coefficients  of  each  MB,  and  the  motion  vector  components  for  the  forward, 
backward,  and  bidirectionally  predicted  MBs.  The  approach  followed  in  this  paper 
involves  utilizing  all  three:  the  MB  types,  DOT  coefficients,  and  motion  vectors. 
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2.2  Related  work 


The  video  indexing  and  retrieval  problem  has  been  addressed  by  researchers  in  a 
nnmber  of  ways.  A  snrvey  of  digital  video  parsing  and  indexing  technologies,  mainly 
in  the  pixel  domain,  is  presented  in  the  paper  by  Ahanger  and  Little  [1],  inclnding  a 
discnssion  of  research  trends  in  video  indexing  and  reqnirements  of  fntnre  data  deliv¬ 
ery  systems.  Topics  snch  as  video  data  indexing,  video  data  modeling,  information 
extraction,  and  video  scene  segmentation  are  also  presented. 

In  other  work,  Zhang  et  ah  [17]  describe  techniqnes  for  nse  in  the  pixel  domain 
for  dealing  with  the  representation  of  shot  content,  as  well  as  content-based  retrieval 
techniqnes  nsing  key  frames  and  temporal  properties  of  shots.  They  also  present 
techniqnes  for  video  parsing  in  the  pixel  domain  followed  by  key  frame  extraction. 
The  representation  of  shot  content  is  based  on  several  types  of  featnres,  inclnding 
color  histogram  and  moment  featnres,  textnre  featnres,  shape  featnres,  and  edge 
featnres.  Similar  work  can  be  fonnd  in  [13]  by  Nagasaka  and  Tanaka  who  present 
pixel-domain  techniqnes  for  performing  fnll-video  search  for  specihed  objects  nsing 
featnres  derived  locally.  Two  recent  papers  by  Flickner  et  ah  [7]  describe  the  QBIC 
system,  which  performs  content-based  retrieval  based  on  color,  shape,  textnre,  and 
sketches  in  large  image  and  video  databases.  Ardizzone  et  ah  [2,  3]  also  deal  with 
content-based  video  indexing  based  on  motion,  color,  and  textnre  and  other  global 
featnres.  All  the  aforementioned  papers  present  techniqnes  in  the  pixel  domain,  bnt 
mnch  less  has  been  done  in  the  compressed  domain. 

Idris  and  Panchanathan  [8]  propose  an  algorithm  based  on  vector  qnantization 
(VQ)  for  indexing  of  video  seqnences  in  compressed  form.  Dnring  compression,  the 
image  is  decomposed  into  vectors  and  mapped  to  a  hnite  set  of  codewords  and  encoded 
nsing  adaptive  VQ.  Each  frame  is  represented  by  a  set  of  labels  and  a  codebook  which 
are  nsed  to  generate  indices.  A  generic  paper  on  compressed-domain  video  indexing 
techniqnes  by  Chang  [5]  describes  some  of  the  issnes  involved  in  addressing  snch  a 
problem  in  the  DCT,  wavelet,  and  snbband  transform  compressed  domains. 


6 


2.3  Approach 

Our  approach  to  compressed  domain  indexing  and  retrieval  can  be  split  into  three 
parts  —  segmentation,  indexing,  and  query  processing.  First,  video  segmentation 
divides  the  incoming  video  into  shots  or  scenes,  and  selects  one  or  more  key  or  rep¬ 
resentative  frames  for  each  shot.  A  shot  in  a  video  clip  is  dehned  as  a  maximal  se¬ 
quence  of  frames  resulting  from  a  continuous  uninterrupted  recording  of  video  data. 
These  shots  may  be  further  subdivided  into  scenes,  if,  for  example,  signihcant  cam¬ 
era  motion^  is  present.  This  step  has  been  presented  in  our  earlier  work  on  video 
segmentation  [9,  10]  and  is  described  briefly  in  Section  3,  along  with  the  key  frame 
identihcation  procedure. 

Second,  features  are  extracted  from  the  key  frames  supplied  by  the  segmentation 
process  and  used  to  create  a  database  index  (Section  4).  The  features  used  are  derived 
from  the  DCT  coefficients  and  the  motion  vector  information  available  in  the  MPEG 
compressed  video.  Unfortunately,  the  dimensionality  of  each  feature  vector  prohibits 
the  implementation  of  standard  database  retrieval  techniques,  not  to  mention  the 
tremendous  overhead  of  storage  per  frame.  Using  a  technique  called  FastMap  [6],  the 
dimensionality  of  the  features  can  be  reduced  to  a  manageable  level  where  they  can 
be  represented  using  standard  database  techniques  (Section  5). 

Finally,  when  need  arises,  the  database  is  accessed  using  the  features  derived 
from  a  query  clip  as  an  index.  When  a  query  arrives,  segmentation  and  key  frame 
extraction  is  performed,  if  necessary,  and  features  are  extracted  from  the  key  frames 
of  the  query.  These  features  are  then  used  to  index  into  the  database  to  perform 
retrieval.  By  mapping  the  query  to  the  same  space  as  the  stored  key  frames,  standard 
similarity  metrics  such  as  Euclidean  distance  between  a  query  frame  and  frames  in 
the  database  can  be  used  to  retrieve  the  best  matches.  Experiments  and  results  of 
query  processing  are  discussed  in  Section  6. 

^Throughout  this  paper,  we  assume  that  “camera  motion”  refers  to  operations  like  panning, 
tilting,  and  zooming  a  stationary  camera  and  not  to  changes  in  the  position  of  the  camera. 
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Video 


Figure  2:  Flowchart  of  our  approach 

Figure  2  shows  the  flowchart  of  how  the  individual  components  of  the  system 
interact. 

3  Video  segmentation  overview 

Our  approach  to  indexing  involves  extracting  a  set  of  key  frames  for  the  entire  clip, 
such  that  as  much  of  the  content  of  the  video  is  captured  as  possible,  but  at  the  same 
time,  redundant  frames  are  excluded.  Two  key  frames  of  essentially  the  same  content 
that  are  separated  by  other  key  frames  are  not  considered  redundant,  as  the  physical 
and  temporal  structure  of  the  video  needs  to  be  preserved. 

The  first  step  involves  segmenting  the  video  by  identifying  the  frame(s)  where 
a  transition  takes  place  from  one  shot  to  another.  A  change  which  occurs  exactly 
between  two  frames  is  called  a  cut  or  a  break,  whereas  transitions  that  occur  gradually 
over  several  frames  are  called  fades,  dissolves,  wipes,  or  special  effect  edits. 

Our  approach  to  segmentation  analyzes  the  types  of  MBs  that  have  been  used 
to  encode  the  P  and  B  frames,  and  uses  the  counts  of  the  different  types  of  MBs  to 
derive  a  metric  that  pinpoints  where  cuts  or  breaks  occur.  An  analysis  of  the  mac¬ 
roblock  types  alone  does  not  always  provide  sufficient  information  to  indicate  that  a 
shot  change  occurs  between  two  frames.  We  have  developed  a  DCT  validation  proce¬ 
dure  that  is  used  to  confirm  the  existence  of  shot  changes  for  which  the  macroblock 
information  is  found  to  be  insufficient. 
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Figure  3:  Diagram  showing  the  shot  subdivision  concept. 


If  large  camera  motion  is  present  in  a  single  shot,  then  two  frames  that  are  spaced 
well  apart  in  the  shot  may  be  quite  dissimilar.  Therefore,  once  video  is  segmented  into 
shots,  these  shots  are  further  segmented  into  “subshots”  based  on  some  attribute,  that 
exists  in  common  among  the  frames  of  the  subshot.  When  shots  are  subdivided  based 
on  changes  in  content  due  to  camera  motion,  these  subdivisions  are  called  “scenes” 
(Figure  3).  Our  approach  to  subdividing  shots  into  scenes  involves  using  the  motion 
vectors  encoded  into  the  MPFG  format  to  determine  any  type  of  camera  motion  that 
may  be  present,  including  zoom-in,  zoom-out,  pan  left,  pan  right,  tilt  up,  tilt  down, 
or  a  combination  of  pan  and  tilt.  We  use  the  “flow”  information  (to  be  described 
in  the  next  section)  to  derive  a  unique  vector  (called  the  dominant  flow  vector)  that 
describes  the  relative  displacement  of  the  contents  of  the  current  frame  with  respect 
to  the  next  frame.  Starting  from  the  hrst  frame  of  the  shot,  by  successively  adding 
these  dominant  flow  vectors,  we  can  determine  the  displacement  of  each  frame  from 
the  hrst  frame.  The  magnitude  of  this  total  displacement  vector  gives  an  estimate  of 
the  magnitude  of  the  perceptual  translation  caused  by  the  motion  of  the  camera.  By 
comparing  this  magnitude  with  the  dimensions  of  the  frame,  we  determine  whether 
the  particular  frame  under  consideration  has  undergone  enough  translation  from  the 
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first  frame,  and  if  so,  the  frame  is  tagged  as  the  possibfe  start  of  a  new  scene.  If 
the  camera  motion  involves  a  zoom  operation,  we  tag  the  last  frame  of  the  zoom 
seqnence  as  the  possible  start  of  a  new  scene.  By  comparing  the  DCT  information 
of  the  tagged  frame  with  the  DCT  information  of  the  frame  from  which  the  total 
displacement  vector  is  calcnlated,  we  can  determine  if  the  cnrrent  frame  is  a  trne 
candidate  for  the  start  of  a  new  scene. 

The  hnal  step  in  the  segmentation  process  involves  identifying  a  key  frame  for 
each  of  the  snbshots  or  scenes.  Different  snbshots  that  have  been  divided  based  on 
camera  motion  may  or  may  not  have  similar  content,  or  there  may  be  different  content 
in  the  same  snbshot;  snch  snbshots  wonld  not  be  the  best  candidates  for  key  frame 
selection.  Choosing  key  frames  of  scenes  allows  ns  to  captnre  most  of  the  content 
variations,  dne  at  least  to  camera  motion,  while  at  the  same  time  exclnding  other  key 
frames  which  may  be  rednndant. 

The  reader  can  refer  to  onr  previons  work  [9,  10]  for  more  information  on  segmen¬ 
tation  of  video. 

3.1  Key  Frame  Identification 

It  is  important  that  the  choice  of  key  frames  be  made  carefnlly,  since  a  key  frame 
will  represent  an  entire  shot  in  all  fntnre  applications.  The  key  frame  candidates 
shonld  possess  all  the  reqnisite  information  to  enable  featnres  to  be  extracted,  and 
also  enable  the  featnres  to  captnre  as  mnch  of  the  content  and  other  attribntes  of  the 
scene  as  possible. 

The  ideal  method  of  selecting  key  frames  wonld  be  to  compare  each  frame  to 
every  other  frame  in  the  scene  and  select  the  frame  with  the  least  difference  from 
other  frames  in  terms  of  a  given  similarity  measnre.  Obvionsly,  this  reqnires  extensive 
compntation  and  is  not  practical  for  most  applications.  On  the  other  hand,  choosing 
the  hrst  frame  seems  to  be  the  natnral  choice,  as  all  the  rest  of  the  frames  in  the  scene 
can  be  considered  to  be  logical  and  continnons  extensions  of  the  hrst  frame,  bnt  it 
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may  not  be  the  best  match  for  all  the  frames  in  the  scene.  A  third  possible  choice 
is  the  middle  frame  of  the  scene,  as  it  might  be  expected  to  have  the  most  similarity 
with  all  the  other  frames  of  the  scene,  althongh  this  is  not  gnaranteed.  In  a  more 
general  framework,  we  wonld  like  to  choose  frames  with  the  greatest  content  or  index 
potential  —  for  example,  frames  with  text,  or  frames  with  a  clear  nnocclnded  object. 
Other  factors  that  influence  the  choice  inclnde  encoding  patterns.  For  example,  the 
freqnency  of  1  frames  may  affect  the  choice,  as  1  frames  represent  the  best  candidates 
for  key  frame  selection  since  they  have  the  actnal  DOT  coefhcients  which  form  the 
spatial  component  of  the  data.  If  1  frames  occnr  fairly  freqnently,  then  the  flrst  1 
frame  can  be  chosen  as  the  key  frame,  ft  is,  however,  possible  to  have  an  entire  scene 
with  no  1  frame,  in  which  case,  alternate  measnres  are  reqnired. 

We  have  fonnd  it  is,  in  general,  snfhcient  to  select  the  flrst  frame  of  each  scene 
as  a  key  frame.  This  is  based  in  the  observation  that  cinematographers  attempt  to 
“characterize”  a  shot  with  the  flrst  few  frames,  before  beginning  to  track  or  zoom  to 
a  close-np.  The  practical  reason  for  this  choice  will  become  clear  in  the  next  section, 
as  we  develop  techniqnes  to  circnmvent  the  problems  dne  to  encoding,  and  generate 
a  framework  where  all  frames  can  be  considered  eqni valent. 

In  the  final  representation,  the  video  is  partitioned  into  a  set  of  scenes  which 
exhibit  consistency  in  content,  and  each  scene  is  represented  by  a  key  frame. 

4  Feature  Extraction 

A  difficnlty  with  identifying  key  frames  in  video  that  has  been  compressed  by  a 
method  like  MPEG  is  that  frames  can  be  of  different  types,  i.e.,  I,  P,  or  B  frames 
and  can  occnr  in  a  variety  of  patterns.  An  I  frame  contains  DCT  coefficients  of 
actnal  pixel  data,  bnt  has  no  motion  vectors,  whereas  a  P  or  B  frame  contains  DCT 
coefficients  generated  from  residnal  error  data  after  prediction  or  interpolation  from 
other  reference  frame(s),  bnt  has  motion  vectors  relating  the  frame  to  its  reference 
frame(s).  Different  MPEG  clips  may  also  have  different  patterns  of  I,  P,  and  B 
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frame  orderings.  A  problem  then  arises  when  we  try  to  identify  key  frames  and 
snbseqnent  index  information.  Shonld  we  identify  only  certain  types  of  frames,  for 
example,  1  frames,  as  key  frames,  or  do  we  desire  the  flexibility  to  choose  any  frame 
independently  of  the  frame  type?  To  avoid  these  problems,  we  desire  a  frame-type- 
independent  representation,  in  which  all  the  featnres  that  we  extract  for  the  indexing 
and  retrieval  phase  are  obtained  independently  of  factors  snch  as  frame  type. 

In  onr  indexing  and  retrieval  phase,  we  nse  the  DC^  coefficients  and  the  motion 
vectors  of  the  key  frames.  The  featnres  that  mnst  be  extracted  from  each  key  frame 
inclnde  the  DC  coefficients,  which  form  the  spatial  component,  and  the  motion  vectors 
of  the  MBs,  which  form  the  temporal  component.  The  next  two  snbsections  explain 
the  techniqnes  nsed  to  extract  these  featnres  from  each  type  of  frame. 

4.1  DCT  estimation 

DCT  coefficients  are  readily  accessible  for  I  frames,  bnt  since  P  and  B  frames  are 
represented  by  the  residnal  error  after  prediction  or  interpolation,  their  DCT  coef- 
hcients  need  to  be  estimated.  To  calcnlate  the  DCT  coefficients  of  an  MB  in  a  P 
frame  or  B  frame,  the  DCT  coefficients  of  the  f6xf6  area  of  the  reference  frame  that 
the  cnrrent  MB  was  predicted  from  need  to  be  calcnlated.  Let  ns  call  this  area  the 
reference  MB  (thongh  it  is  not  an  actnal  MB).  Since  the  DCT  is  a  linear  transform, 
the  DCT  coefficients  of  the  reference  MB  in  the  reference  frame  can  be  calcnlated 
from  the  DCT  coefficients  of  the  fonr  MBs  that  can  overlap  this  reference  MB,  albeit 
with  snbstantial  compntational  expense.  It  is  easy,  however,  to  calcnlate  reasonable 
approximations  to  the  DC  coefficients  of  an  MB  of  a  P  or  B  frame.  Techniqnes  for 
doing  this  were  snggested  by  Yeo  and  Lin  [16]  and  also  by  Shen  and  Delp  [14]. 

Fignre  4  shows  an  MB  in  a  P  frame,  MBc„r,  being  predicted  from  a  16x16  area 
denoted  by  MBi^e/.  While  encoding  the  P  frame,  only  the  residnal  error  of  MBc„r 

^Of  the  64  DCT  coefficients,  the  coefficient  with  zero  frequency  in  both  dimensions  is  called  the 
‘DC  coefficient’,  while  the  remaining  63  are  called  the  ‘AC  coefficients’. 
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Motion  Vector  =  (x,y) 

Figure  4:  DC  estimation:  MBcur  is  predicted  from  The  motion  vector  is 

with  respect  to  MBi^e/  is  stored.  The  DC  coefficients  of  MBi^g/  can  be  calculated 
from  the  DCT  coefficients  of  four  MBs  —  MBi,  MB2,  MB3,  and  MB4  (see  [f6]  for  the 
details  of  the  calculation).  To  avoid  expensive  computation,  the  DC  coefficient  alone 
is  approximated  by  a  weighted  sum  of  the  DC  coefficients  of  the  four  MBs,  with  the 
weights  being  the  fractions  of  the  areas  of  these  MBs  that  overlap  the  reference  MB, 
i.e., 

4 

^^(MBi^e;)  =  J2w,x  DC{MB,) 

4  =  1 

where  Wi  is  given  by  the  ratio  of  the  area  of  the  shaded  region  of  MBj  to  its  total 
area. 

Similarly,  if  an  MB  in  a  B  frame  is  interpolated  from  two  reference  MBs,  its  DC 
coefficient  is  approximated  by  an  average  of  the  estimated  DC  coefficients  of  each  of 
these  two  MBs. 


4.2  Flow  estimation 

An  MB  can  have  zero,  one,  or  two  motion  vectors  depending  on  its  frame  type  and 
whether  the  MB  is  intra-coded,  forward-  or  backward-predicted,  or  bidirectionally- 
predicted,  respectively.  Moreover,  the  motion  vectors  of  a  given  frame  can  be  forward- 
predicted  or  backward-predicted  with  respect  to  a  reference  frame  which  may  or  may 
not  occur  adjacent  to  it.  A  problem  occurs  if,  for  example,  we  wish  to  compare  an  1 
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frame  with  no  motion  vectors  to  a  B  frame  with  primarily  bidirectionally-predicted 
MBs,  or  even  two  B  frames,  one  of  which  is  primarily  forward- predicted  and  the  other 
primarily  backward-predicted.  We  therefore  reqnire  a  more  nniform  set  of  motion 
vectors,  independent  of  the  frame  type  and  the  direction  of  prediction. 

Onr  approach  involves  representing  each  motion  vector  as  a  backward-predicted 
vector  with  respect  to  the  next  frame,  independent  of  frame  type.  The  set  of  motion 
(or  “flow”)  vectors  for  each  frame  then  represents  the  direction  of  motion  of  each  MB 
with  respect  to  the  next  frame. 

It  shonld  be  noted  that  not  all  MBs  will  have  this  flow  vector  associated  with  them; 
bnt  the  nnmber  of  snch  MBs  is  rarely  large  enongh  to  affect  onr  analysis.  Across  shot 
cnts  or  breaks,  most  of  the  MBs  are  not  expected  to  have  flow  information. 

The  first  step  in  deriving  the  flow  is  to  analyze  the  frame-type  pattern  (i.e.,  the 
pattern  of  I,  P,  and  B  frames)  in  the  MPEG  stream.  If  the  video  is  in  XING  format, 
i.e.,  it  contains  only  I  frames,  then  there  exists  no  motion  information  and  this  analysis 
is  not  relevant. 

For  clips  containing  only  I  and  P  frames:  If  there  are  only  P  frames  between 
I  frames,  and  there  are  no  B  frames,  then  flow  can  be  derived  for  each  of  the  frames 
between  two  consecntive  I  frames,  inclnding  the  I  frames  themselves,  except  for  the 
last  P  frame,  for  which  we  have  no  information  abont  its  relationship  with  the  I  frame 
that  follows  it. 

The  flow  for  an  I  frame  that  is  followed  by  a  P  frame  is  the  set  of  forward-predicted 
motion  vectors  of  the  P  frame  after  inversion.  Intnitively,  if  an  MB  in  the  P  frame 
is  displaced  by  a  motion  vector  (x,  y)  with  respect  to  an  MB  in  the  I  frame,  then  it 
is  logical  to  conjectnre  that  the  latter  MB  is  displaced  by  a  motion  vector  (  — x,  —y) 
with  respect  to  the  MB  in  the  P  frame.  The  same  reasoning  is  applied  to  the  flow 
estimation  of  the  MBs  of  a  P  frame  that  is  followed  by  another  P  frame.  The  MPEG 
stream,  however,  does  not  contain  any  information  relating  a  P  frame  to  the  I  frame 
that  follows  it,  nnless  B  frames  are  present  between  them. 
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For  clips  containing  B  frames:  Most  MPEG  streams  contain  B  frames  between 
consecntive  reference  frames.  Let  ns  consider  two  consecntive  reference  frames,  Ri 
and  Rj.  Define  domain  T>  to  be  the  set  containing  all  B  frames  between  the  two 
consecntive  reference  frames  Ri  and  ifj,  and  the  frame  Rj  itself: 

Ri  B1B2  . . .  BnRj 

- - V - " 

domain  D 

Let  the  B  frames  in  the  domain  be  denoted  by  i?i, .  .  . ,  i?„,  where  n  is  the  nnmber 
of  B  frames  between  these  two  reference  frames  (typically,  n  =  2).  The  hrst  step  is  to 
derive  the  flow  between  the  hrst  reference  frame  Ri  and  its  next  frame  Bi  nsing  the 
forward-predicted  motion  vectors  of  Bi.  This  case  is  similar  to  the  LP  case  discnssed 
above.  The  inverses  of  the  forward-predicted  motion  vectors  form  the  how  vectors 
for  the  MBs  of  Ri.  Similarly,  nsing  the  backward  motion  vectors  of  frame  with 
respect  to  ifj,  the  how  is  derived  for  frame  Bn-  There  is  no  need  to  invert  the  motion 
vectors  here,  since  the  how  vectors  essentially  are  backward-predicted  vectors.  Flow 
for  Rj  will  be  derived  when  Rj  is  analyzed  with  the  reference  frame  following  Rj. 

We  have  not  yet  considered  the  case  where  an  MB  in  Bi  does  not  have  a  forward- 
predicted  motion  vector  with  respect  to  ifj,  or  the  case  where  an  MB  in  does 
not  have  a  backward-predicted  motion  vector  with  respect  to  Rj.  In  the  former 
case,  we  look  at  the  next  frames  snccessively  nntil  we  hnd  a  frame,  say  Bk,  in  which 
the  corresponding  MB  has  a  valid  forward-predicted  motion  vector,  and  we  nse  the 
inverse  of  that  vector.  Since  this  vector  is  predicted  from  k  frames  earlier,  we  scale 
it  down  by  a  factor  of  k.  If  we  are  not  able  to  hnd  snch  a  B  frame,  we  tag  that 
how  vector  as  nndehned.  Similarly,  in  the  latter  case,  we  look  at  the  previons  frames 
snccessively  nntil  we  hnd  a  B  frame  with  a  valid  backward-predicted  motion  vector, 
which  is  similarly  scaled  down  by  the  nnmber  of  frames  over  which  it  was  predicted 
before  being  assigned. 

The  next  step  is  to  determine  the  how  between  consecntive  B  frames  in  the  domain. 
Obvionsly,  there  is  no  direct  interaction  between  snch  consecntive  B  frames  in  the 
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MPEG  stream,  in  contrast  to  the  aforementioned  flow  derivation  step  involving  a 
reference  frame. 

Flow  between  snccessive  B  frames  is  derived  by  analyzing  corresponding  MBs 
in  those  B  frames  and  their  motion  vectors  with  respect  to  their  reference  frames. 
We  want  to  And  the  vector  from  an  MB  in  one  B  frame,  say  i?i,  to  the  corre¬ 
sponding  MB  in  the  next  B  frame,  say  B2.  Since  each  MB  in  each  B  frame  can 
be  of  one  of  three  types^,  namely  forward-predicted  (F),  backward- predicted  (B),  or 
bidirectionally-predicted  (D),  there  exist  nine  possible  combinations.  We  can  repre¬ 
sent  these  nine  pairs  by  FF,  FB,  FD,  BF,  BB,  BD,  DF,  DB,  and  DD.  Each  of  these 
nine  combinations  is  considered  individnally,  and  flow  is  estimated  between  them  by 
analyzing  each  of  the  motion  vectors  with  respect  to  the  reference  frame. 

Let  the  forward-predicted  motion  vector  of  the  cnrrent  MB  in  Bi  be  denoted  by 
B\Ri^  and  let  the  forward- predicted  motion  vector  of  the  corresponding  MB  in  B2  be 
denoted  by  B2Ri.  If,  we  denote  the  flow  between  the  MBs  of  the  B  frames  by  i?ii?2, 
then  we  have  the  relationship 

—  B2Ri  =  —B\Ri  +  B1B2 

from  which  B1B2  can  be  obtained  easily.  Since  only  the  forward-predicted  motion 
vectors  B^Ri  and  B2Ri  are  reqnired,  this  covers  the  cases  of  the  MB  pair  having 
patterns  FF,  FD,  DF,  or  DD. 

Similarly,  if  the  MB  pair  has  pattern  BB,  BD,  or  DB,  we  can  And  B1B2  by  nsing 
the  backward-predicted  motion  vectors  of  Bi  and  B2  with  respect  to  Rj.  Let  the 
backward-predicted  motion  vector  of  the  cnrrent  MB  in  Bi  be  denoted  by  BiRj,  and 
let  the  backward-predicted  motion  vector  of  the  corresponding  MB  in  B2  be  denoted 
by  B2Rj.  Then  we  have 

B\Rj  =  B1B2  +  i?2-Rj 

^Skipped  MBs  are  also  actually  either  forward,  backward,  or  bidirectionally  predicted.  Intra- 
coded  MBs  have  no  motion  vectors  and  are  therefore  not  considered,  and  their  flow  vectors  are 
undefined. 
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from  which  -B1-B2  can  be  obtained  easily. 

The  only  remaining  cases  to  be  considered  are  FB  and  BF.  Clearly,  for  the  case  of 
FB,  the  flow  B1B2  is  nndehned,  becanse  this  pattern  is  an  indication  of  the  presence 
of  a  cnt  between  the  two  B  frames. 

For  the  BF  case,  we  hrst  hnd  the  flow  vector  RiBi  for  the  corresponding  MB  of  Ri 
nsing  the  scale-down  techniqne  explained  above.  Then,  nsing  the  forward-predicted 
motion  vector  of  i?2,  we  calcnlate  B1B2  as 

—  B2Ri  =  RiB\  +  B1B2 

Similarly,  we  And  the  flow  vector  i?2-Rj  for  the  MB  in  B2  nsing  the  scale-down  tech¬ 
niqne.  Using  the  backward-predicted  motion  vector  of  i?i,  B^Rj^  we  then  calcnlate 
B1B2  as 

B\Rj  =  B1B2  +  i?2-Rj 

Since  the  vectors  have  been  estimated  with  respect  to  the  reference  frames  nsing 
the  scale-down  techniqne,  we  take  the  average  of  these  two  vectors  to  yield  a  better 
estimate  of  the  actnal  B1B2. 

It  shonld  be  mentioned  that  it  may  not  always  be  appropriate  to  nse  the  vectors  of 
the  same  corresponding  MBs  over  the  B  frames  and  the  reference  frames.  Consider, 
for  example,  Fignre  5.  We  wish  to  calcnlate  the  flow  of  where  I  and  m  denote 

the  indices  of  the  cnrrent  MB  in  the  array  of  MBs.  The  forward-predicted  vector  of 
B2  is  large  enongh  that  its  reference  t6xt6  area  is  from  another  adjacent  MB,  with 
indices  /  —  t,  m  —  t.  We  then  nse  the  flow  of  instead  of  the  flow  of  {Ri)i^m-, 

which  wonld  not  be  proper.  We  assnme  that  the  need  to  nse  vectors  of  alternate  MBs 
arises  only  when  vectors  have  been  predicted  over  more  than  one  frame,  i.e.,  they  are 
not  predicted  over  adjacent  frames.  We  assnme  that  nsing  corresponding  MBs  of  B\ 
and  B2  is  snfficient. 

Accuracy  of  computation:  To  evaluate  the  accuracy  of  the  estimation,  we  must 
provide  ground  truth  and  compare  them  to  the  results  from  the  flow  estimation  step. 
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Figure  5:  Flow  estimation:  The  MB  in  B2  is  forward-predicted  from  an  adjacent  MB 
in  Ri  whose  flow  is  used  to  calculate  the  flow  of  the  middle  MB  in  Bi. 


Using  the  original  uncompressed  image  frames,  we  encode  the  frames  into  MPFG 
hies  in  which  all  B  frames  were  replaced  by  P  frames  using  a  widely  available  MPFG 
encoder  called  mpeg_encode  developed  by  the  Plateau  Multimedia  Research  Group 
at  the  University  of  California  in  Berkeley  [4],  IBBPBB  ordering,  for  example,  then 
becomes  IPPPPP  ordering.  Hence  every  frame  is  a  reference  frame  and  all  P  frames 
are  predicted  from  their  respective  previous  reference  frames,  1  or  P.  The  1  frames, 
on  the  other  hand,  are  not  related  to  any  previous  frames,  and  therefore  the  last  P 
frame  occurring  before  an  1  frame  has  no  flow  information.  Nonetheless,  using  the 
other  frames,  we  are  still  able  to  obtain  a  good  evaluation  of  the  results  of  the  flow 
estimation  process. 

We  apply  the  flow  estimation  step  to  the  hies  in  IPPPPP  format,  and  we  com¬ 
pare  the  how  vectors  of  the  frames  of  the  two  encodings  in  three  ways.  First,  we 
quantize  the  vectors  of  the  two  encodings  in  the  four  principal  directions  and  com¬ 
pare  the  directions  of  the  corresponding  MBs.  Second,  we  compare  the  angles  the 
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Table  1:  Results  of  the  flow  estimation  process. 


Results  of  flow  estimation 

all  frames 

frames  in  valid 
motion  sequences  only 

matching  directions  (%) 

70  % 

71  % 

angle  difference  (deg) 

5.f° 

5.4° 

pixel  difference  ratio 

0.36 

0.33 

corresponding  flow  vectors  make  with  the  positive  x-axis,  and  determine  the  average 
difference  in  angle  between  the  vectors.  Third,  we  determine  the  average  magnitude 
of  the  vector  difference  of  two  corresponding  vectors  in  pixel  units.  The  ratio  of 
this  average  difference  vector  magnitude  to  the  average  magnitude  of  the  flow  vec¬ 
tors  gives  a  metric  for  our  evaluations.  Due  to  imperfections  during  encoding  of  the 
MPEG  video,  experimental  results  show  that  noise  is  frequently  present  in  the  mo¬ 
tion  vectors.  Full  search  during  the  block  matching  phase  of  the  encoding  process  is 
very  time-consuming.  Therefore,  to  exclude  the  noise,  we  discard  the  top  15%  of  the 
magnitudes  and  the  angle  differences,  and  only  consider  the  remainder  for  evaluation. 
The  results  of  the  three  experiments  are  summarized  in  the  three  rows  of  Table  1. 
The  results  in  the  hrst  column  are  for  all  the  frames  of  our  test  clips,  whereas  the 
results  in  the  second  column  are  for  the  frames  that  belong  to  sequences  in  motion 
classes  such  as  zooms,  pans,  and  tilts. 

The  results  from  the  test  involving  only  the  frames  in  valid  motion  sequences 
(column  2)  are  marginally  better  than  those  from  the  test  involving  all  the  frames 
(column  1)  because  the  flow  vectors  are  more  organized  due  to  the  distinct  motion, 
and  during  encoding,  the  block  matching  usually  is  not  very  flexible.  In  uniformly 
textured  backgrounds,  or  in  frames  with  no  motion  or  irregular  motion,  the  flow  can 
be  predicted  from  different  directions. 

Figure  6(a)  shows  a  plot  of  the  flow  vector  angle  differences  between  the  IBBPBB 
and  the  IPPPPP  encodings  in  a  typical  frame  after  sorting  in  ascending  order  of 
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Figure  6:  Flow  estimation  plots  :  (a)  A  sorted  plot  of  the  flow  vector  angle  differences 
between  the  two  encodings  of  the  same  clip,  (h)  A  plot  showing  the  relationship 
between  the  average  flow  vector  angle  difference  and  the  percent  of  highest  angle 
differences  omitted. 


angle  differences.  Figure  6(b)  shows  how  the  average  angle  difference  varies  with 
the  percentage  of  highest  angle  differences  omitted  from  the  calculation.  Since  the 
number  of  angle  differences  with  large  magnitudes  is  relatively  low,  the  average  angle 
difference  drops  rather  quickly  with  increasing  percentage  of  omitted  angle  differences. 

Fxamples  of  the  estimated  flow  vectors  are  shown  in  Figures  7(a)  and  (b).  The 
hrst  pair  was  taken  from  the  beginning  of  a  “pan  left  and  tilt  up”  sequence.  The 
second  pair  was  taken  from  a  sequence  in  which  the  camera  is  rotating  in  the  clock¬ 
wise  direction,  and  is  gently  tilting  up  at  the  same  time.  For  each  pair,  the  MB 
image  on  the  left  was  derived  from  the  re-encoded  IPPPPP  format  hies,  and  its  cor¬ 
responding  image  on  the  right  contains  the  estimated  how  between  two  B  frames 
from  the  IBBPBB  encoded  format  hies.  The  shade  of  the  MB  represents  the  direc¬ 
tion  of  the  how  vector;  the  ranges  of  directions  for  each  shade  are  shown  in  Figure 
7(c).  For  “zero”  vectors,  the  shade  shown  in  the  center  of  the  circle  was  used,  and 
for  “undehned”  vectors,  the  shade  WHITF  was  used. 

We  observe  that  by  using  the  how  vectors  of  each  frame,  movements  of  objects  can 
be  sketched.  For  example,  a  group  of  adjacent  MBs  having  similar  how  vectors  can  be 
associated  with  a  rigid  object  undergoing  some  motion.  Once  a  frame  is  segmented 
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Figure  7:  Examples  of  flow  estimation:  In  (a)  and  (b),  the  images  on  the  left  are  taken 
from  the  IPPPPP  encoded  files,  and  the  corresponding  images  on  the  right  show  the 
estimated  flow  from  the  IBBPBB  encoded  files.  In  (c),  the  shade  codes  depend  on  the 
range  of  directions  of  the  flow  vectors  shown  on  the  circle.  The  shade  at  the  center 
of  the  circle  is  nsed  for  “zero”  vectors,  and  WHITE  for  “nndeflned”  vectors. 
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into  regions  having  different  flow  vectors,  the  individnal  segments  can  be  displaced 
according  to  their  flows,  and  these  deformed  frames  can  be  nsed  to  generate  more 
robnst  resnlts  dnring  the  retrieval  phase.  This  is  another  reason  we  select  the  first 
frame  as  a  key  frame  of  a  scene. 

Using  this  frame-type-independent  framework,  we  are  able  to  consider  videos  that 
have  differing  IPB  patterns  as  being  eqnivalent,  and  we  are  not  constrained  to  com¬ 
paring  videos  that  have  the  same  patterns. 

5  Video  Indexing  using  FastMap 

An  important  goal  of  onr  research  is  to  be  able  to  organize  and  retrieve  video  data  that 
a  nser  is  interested  in.  Now  that  we  have  extracted  a  nniform  representation  to  work 
on,  in  this  section  we  explore  techniqnes  that  can  be  nsed  to  organize  and  retrieve 
video  clips  in  the  compressed  domain.  In  the  next  section,  we  provide  experimental 
resnlts  to  evalnate  how  effective  these  techniqnes  are. 

Given  a  set  of  video  clips  encoded  in  MPEG,  we  wonld  like  to  index  them  to 
allow  “qnerie-by-frame” ,  or  “qnery-by-seqnence” .  These  allow  ns  to  ‘Retrieve  the 
video  clips  in  the  database  most  similar  to  this  qnery  image’,  or  ‘Retnrn  the  three 
video  clips  most  similar  to  the  given  seqnence  of  frames’,  respectively.  As  video 
clips  have  both  a  spatial  dimension  (represented  by  DOT  coefficients)  and  a  temporal 
dimension  (represented  by  motion  vectors),  we  propose  to  nse  both  sets  of  information 
to  determine  the  correct  answer  to  the  qnery.  If  the  qnery  consists  of  a  single  frame,  of 
conrse,  no  temporal  information  can  be  derived  to  match  with  the  motion  information 
of  the  videos  stored  in  the  database.  In  onr  approach,  the  key  frame  of  each  scene 
serves  as  the  index  for  that  scene,  with  its  DC  coefficients  representing  the  spatial 
dimension  and  its  flow  vectors  representing  its  temporal  dimension. 

Using  the  DC  coefficients  of  all  the  MBs  of  a  frame  alone  leads  to  large  featnre 
vectors,  and  standard  database  techniqnes  become  impractical.  Therefore,  we  nse  a 
techniqne  called  FastMap  to  rednce  the  size  of  the  featnre  vectors  to  a  manageable 
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level,  and  use  these  low-dimensional  feature  vectors  for  indexing.  The  primary  ad¬ 
vantage  of  FastMap  is  that  it  runs  in  time  linear  in  the  number  of  objects  in  the 
database. 

5.1  Spatial  or  frame  similarity 

Spatial  similarity  between  two  frames  implies  that  the  frames  have  similar  spatial 
properties  such  as  luminance,  chrominance,  texture,  and  shape.  Testing  for  this  simi¬ 
larity  involves  comparing  values  that  represent  those  properties.  In  the  pixel  domain, 
the  color  and  luminance  values  are  represented  by  the  values  associated  with  a  pixel, 
but  in  the  compressed  domain,  the  64  DCT  coefficients  together  represent  the  values 
for  an  8x8  block  of  pixels.  The  DC  coefficient  alone  specihes  the  average  intensity 
value  for  that  block.  Since  we  do  not  want  to  decompress  individual  frames,  we  use 
the  DC  coefficients  of  the  luminance  and  chrominance  components  and  compare  them 
with  the  DC  coefficients  of  the  corresponding  blocks  of  other  frames  to  test  for  spatial 
similarity. 

One  approach  to  computing  spatial  similarity  is  to  store  all  the  DCT  information 
for  every  frame  of  every  clip,  since  the  DCT  information  provide  a  reasonable  ab¬ 
straction  of  the  spatial  information.  When  a  query  frame  arrives,  we  can  compare  it 
with  all  other  available  frames  in  order  to  determine  the  most  similar  one.  However, 
this  approach  is  not  efficient  in  either  time  or  space.  Since  most  of  the  frames  are 
similar  to  the  frames  adjacent  to  them,  large  amounts  of  redundancy  exist.  We  would 
like  to  use  properties  of  the  video  clips  to  search  only  in  a  small  subset  of  the  frames 
of  a  clip,  and  still  generate  robust  matches. 

As  stated  earlier,  each  clip  can  be  divided  up  into  shots  and  then  into  scenes, 
with  each  scene  denoting  a  basic  coherent  sequence  of  similar  frames.  A  key  frame  is 
chosen  for  each  scene  and  that  frame  is  used  to  represent  that  scene.  The  question 
is  then  how  to  determine  similarity  between  frames.  One  approach  is  to  compare  the 
corresponding  DCT  coefficients  of  the  frames  directly,  since  in  the  compressed  domain. 
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the  DCT  coefficients  best  represent  the  spatial  information  of  the  frames.  A  simpler 
approach  is  to  treat  each  frame  as  a  vector  of  DC  coefficients  alone  (as  opposed  to  all 
64  DCT  coefficients)  and  nse  the  Enclidean  distance  between  the  vectors  to  determine 
the  similarity  of  the  frames.  This  is  accomplished  as  follows. 

In  a  video  database,  each  key  frame  can  be  represented  as  a  point  in  an  N- 
dimensional  Enclidean  space,  where  N  is  the  number  of  DC  coefficients.  Eor  a 
320x240  frame,  there  are  300  MBs,  and  if  we  nse  the  six  DC  coefficients  that  ex¬ 
ist  in  each  MB,  then  N  is  fSOO.  Traditional  mnlti-dimensional  indexing  techniqnes 
like  R-trees  tend  to  be  very  inefficient  in  snch  high-dimensional  spaces.  Thns  we  need 
to  have  a  way  of  redncing  the  dimensionality  of  the  points  to  a  manageable  level  while 
maintaining  the  proximity  of  the  points  (and  thns  the  similarity). 

To  achieve  this,  we  nse  the  FastMap  algorithm  described  by  Ealontsos  and  Lin 
[6].  FastMap  takes  as  inpnt  a  distance  fnnction  between  key  frames  and  ontpnts  a 
point  in  a  low-dimensional  space  for  every  key  frame  in  linear  time.  The  main  char¬ 
acteristic  of  FastMap  is  that  the  ontpnt  points  tend  to  approximate  well  the  relative 
distance  between  the  original  key  frames  while  keeping  the  number  of  dimensions  to 
a  manageable  level. 

The  basic  idea  is  that  FastMap  assnmes  the  objects  do  indeed  he  in  a  certain 
nnknown,  &- dimensional  space.  The  goal  is  to  recover  the  valnes  of  each  dimension, 
given  only  the  distances  between  the  ‘points’.  This  is  achieved  throngh  the  nse  of 
projection:  we  choose  two  objects  Oa  and  Ob  (referred  to  as  ‘pivot  objects’  from  now 
on),  and  consider  the  ‘line’  that  passes  throngh  them.  We  project  all  the  points  onto 
this  line.  Since  we  have  the  distance  information  between  the  points,  we  ntilize  the 
cosine  law  to  recover  the  coordinates,  as  shown  in  Eignre  8(a). 

Given  an  object  Oi  to  be  projected,  we  consider  the  triangle  formed  by  Oa-,  Ob, 
and  Oi-  The  coordinate  for  object  Oi  is  eqnal  to  the  length  of  the  line  segment  OaE. 
Applying  the  cosine  law  gives 


Xi  = 


da,i  T  f,  (if,  j 


2<ia,f, 


(1) 
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where  dij  is  the  distance  T>[0i^0j)  between  two  objects.  Note  that  the  compntation 
of  Xi  reqnires  only  the  distances  between  objects,  which  are  given.  Observe  that, 
based  on  Eqnation  1,  we  can  map  objects  into  points  on  a  line,  preserving  some  of 
the  distance  information.  Thns,  we  have  solved  the  problem  for  k  =  \. 


Fignre  8:  lllnstration  for  FastMap  :  (a)  The  cosine  law.  (b)  Projection  on  to  the 
perpendicnlar  hyperplane. 

In  order  to  map  the  objects  into  a  mnlti-dimensional  space,  we  need  to  extend  the 
method  so  that  more  coordinates  can  be  generated.  Once  again,  we  assnme  the  points 
to  be  lying  in  a  fc-dimensional  space  (Fignre  8(b)).  After  the  hrst  step  of  FastMap,  we 
have  fonnd  a  line  (Oa,  Ob)  on  which  we  can  project  all  the  points.  Consider  a  k  —  1- 
dimensional  hyperplane  H  perpendicnlar  to  that  line.  If  we  project  all  the  points 
onto  this  plane,  the  points  lie  in  a  A;  —  1-dimensional  space.  Let  0/  stand  for  the 
projection  of  Oi  (for  i  =  1, .  .  .  ,  n).  Thns,  the  problem  is  transformed  to  one  of  hnding 
the  coordinates  for  objects  on  the  hyperplane  7i.  This  is  the  same  as  the  original 
problem,  with  k  decreased  by  1.  Once  we  obtain  the  projected  distances  between  the 
points,  we  recnrsively  apply  this  algorithm  to  generate  the  next  coordinate. 

To  obtain  the  projected  distances  T>'{)  between  the  points  on  the  hyperplane  FL, 
consider  the  triangle  (C,  Oi,  Oj)  in  Fignre  8(b).  The  Pythagorean  theorem  gives  the 
following  eqnation  from  which  the  projected  distances  are  calcnlated: 
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(D'(0/,  0/)y  =  {V{0,,  0,)f  -  {x,  -  x,f  z,  j  =  1, . . . ,  n  (2) 

Although  we  indicated  that  we  were  projecting  onto  a  k  —  1 -dimensional  space, 
note  that  initially  k  is  arbitrary.  We  can  repeat  the  above  procedure  to  generate 
coordinates  in  any  number  of  dimensions. 

For  each  new  axis/dimension,  the  basic  steps  of  the  algorithm  are: 

1.  Pick  two  pivot  objects  (as  far  apart  as  possible). 

2.  Compute  the  projection  of  each  object  on  the  ‘line’  dehned  by  the  pivot  objects. 

Note  that  the  second  step  takes  linear  time.  In  order  for  the  algorithm  to  run  fast, 
we  need  to  ensure  that  the  hrst  step  also  takes  linear  time.  Thus  we  need  a  heuristic 
that  can  select  the  pivots  in  linear  time.  We  would  like  the  pivots  to  be  far  apart, 
however,  so  that  the  objects  will  be  more  spread  out  along  the  projection  axis.  To 
avoid  costly  O(n^)  algorithms,  we  use  a  linear  time  heuristic:  starting  with  a  point, 
pick  the  point  that  is  farthest  away  from  it.  Then  use  this  new  point,  and  repeat  this 
heuristic.  Thus  the  complexity  of  FastMap  is  0{n).  The  reader  can  refer  to  a  paper 
on  FastMap[6]  for  more  information,  including  the  pseudo-code  of  the  algorithm. 

In  retrieval,  it  is  necessary  to  map  new  objects  (like  queries)  onto  the  space  formed 
by  FastMap.  This  can  be  done  efficiently.  Since  we  can  retain  the  pivots  picked  by 
FastMap.,  we  can  calculate  the  projection  of  the  new  point  onto  each  axis  to  obtain 
its  coordinates. 

Using  FastMap,  together  with  the  Euclidean  distance  function,  we  can  organize  the 
frames  in  an  efficient  spatial  data  structure  and  retrieve  nearest  neighbors  efficiently. 

5.2  Temporal  similarity 

As  stated  earlier,  MPEG  streams  provide  motion  information  as  part  of  the  encoding. 
Many  clips  have  shots  of  fairly  similar  content,  e.g.,  conversational  scenes.  Key  frames 
can  be  generated  for  the  same  content  or  action  occurring  at  different  points  in  a  clip. 
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If  the  query  comes  in  the  form  of  a  video  clip,  e.g.,  ‘Retrieve  video  clips  that  might 
contain  this  short  video  scene’,  we  can  also  compare  the  motion  information.  We  can 
consider  the  corresponding  macroblocks  of  the  query  key  frame,  and  the  set  of  key 
frame  candidates  short-listed  by  the  technique  presented  in  the  previous  section,  to 
hnd  the  candidate  key  frame  undergoing  the  most  similar  motion,  and  return  that 
key  frame  as  the  best  match. 

Our  focus  is  primarily  on  the  direction  of  the  motion  and  not  the  maginitude, 
which  gives  us  a  measure  less  susceptible  to  noise  and  minor  changes.  For  each 
frame  to  be  indexed,  we  identify  the  direction  of  the  flow  vectors  and  quantize  them 
angularly  into  eight  bins.  We  compare  the  flows  of  corresponding  blocks  and  use 
the  number  of  corresponding  blocks  that  have  the  same  flow  direction  as  a  second 
measure  of  similarity. 

6  Experiments  and  Results 

Our  system  combines  both  spatial  and  temporal  similarity  techniques  to  provide  a 
simple  and  efficient  method  of  indexing.  First,  we  index  into  the  key  frames  using  the 
vectors  generated  by  FastMap  from  the  DC  coefficients.  These  DC  coefficients  are 
readily  available  if  the  query  frame  is  an  I  frame  of  a  video  clip,  or  a  JPFG  image. 
If  it  is  a  B  or  P  frame,  the  coefficients  are  estimated  using  the  technique  referred  to 
earlier,  or  if  it  is  an  image  in  a  different  format,  it  can  be  converted  to  a  .IPFG  image. 
If  a  query  consists  of  only  one  frame,  we  use  the  index  of  the  FastMap  vector  to  locate 
the  key  frame  which  is  most  similar  to  the  query.  We  can  also  return  the  first  few 
most  similar  frames  and  let  the  user  browse  the  results.  In  the  case  of  a  short  query 
sequence,  we  ask  one  query  for  each  frame  of  the  query  sequence,  and  tabulate  the 
votes  to  identify  the  winners.  These  key  frames  are  treated  as  candidates,  and  we 
compare  any  motion  information  to  modify  their  ranking. 

For  the  experiments,  we  used  a  total  of  30  videos  containing  approximately  15,000 
frames  digitized  at  frame  rates  varying  from  5  to  30  frames  per  second  (fps).  A  total 


27 


of  329  key  frames  were  identified.  For  the  experiments,  we  nsed  the  hrst  I  frame  in  a 
snbshot  or  scene  as  a  key  frame,  instead  of  the  hrst  frame  regardless  of  the  frame  type, 
becanse  the  1  frames  do  not  have  the  estimated  DC  coethcients  that  P  and  B  frames 
have.  This  enabled  ns  to  stndy  the  effectiveness  of  the  method  nnder  ideal  conditions 
instead  of  having  noise  or  artifacts  from  the  DC  estimation  hinder  onr  evalnation. 
The  clips  can  ronghly  be  divided  into  hve  categories  —  sports  clips,  news  clips,  movie 
clips,  commercial  clips,  and  ontdoor  snrveillance  clips.  The  hve  gronps  of  videos 
have  different  visnal  properties.  For  example,  the  sports  clips  have  predominantly  a 
green  grassy  held  in  the  backgronnd  and  involve  large  amonnts  of  motion,  while  the 
newscast  clips  featnre  people  speaking  in  front  of  a  (generally)  dark  backgronnd  with 
very  little  motion.  The  snrveillance  videos  were  taken  in  bright  daylight  ontdoors. 

Clustering:  Fignre  9  shows  the  clnstering  of  the  key  frames  achieved  in  a  FastMap 
space  of  three  dimensions.  The  key  frames  in  Fignre  9(a)  were  taken  from  different 
clips  of  the  same  movie  and  clnster  well.  Fignre  9(b)  shows  three  distinct  clnsters 
of  key  frames,  with  the  sparse  clnsters  at  the  left  and  middle  taken  from  one  movie. 
The  two  clips  that  form  those  clnsters  are  of  the  same  two  scenes  with  each  scene 
forming  one  clnster.  The  large  clnster  at  the  right  is  composed  of  key  frames  taken 
from  fonr  news  interview  clips  of  essentially  the  same  content.  Fignre  tO(a)  shows  the 
gronping  of  the  key  frames  of  the  snrveillance  shots,  and  Fignre  tO(b)  shows  primarily 
two  clnsters,  with  the  smaller  clnster  containing  key  frames  from  a  football  game  on 
natnral  grass,  and  the  larger  clnster  containing  key  frames  from  another  football  game 
on  astrotnrf  and  a  baseball  game.  In  Fignres  9  and  tO,  the  same  scale  is  nsed  for  each 
of  the  three  axes. 

Indexing  and  Retrieval:  Any  newly  developed  techniqne  is  incomplete  withont  a 
thorongh  testing  of  its  reliability  and  validity  based  on  qnantitative  resnlts.  First,  a 
test  needs  to  be  performed  in  which,  if  a  qnery  that  is  clearly  very  similar  to  an  existing 
index  in  the  databse  is  posed  to  the  system,  the  system  hnds  the  appropriate  match.  If 
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’apollo_l.dat’  « 
’apollo_2.dat’  + 
'apollo_3.dat’  □ 
'apollo_4.dat’  x 


□ 


(a) 


'jp2.dat’  « 
'jp3.dat’  + 
’newsl.dat’  □ 
’news2.dat’  x 
’news3.dat’  * 
’news4.dat’  * 


(b) 


Figure  9:  Key  frame  clusterings. 
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’video5.dat’  « 
’video6.dat’  + 
’video7.dat’  □ 
Video8.dat’  x 
’video9.dat’  * 
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Figure  10:  Key  frame  clusterings. 
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this  appropriate  match  is  known  a  priori,  evalnating  performance  is  straightforward. 
Second,  if  random  qneries  consisting  of  similar  and  dissimilar  qneries  are  posed  to 
the  system,  the  system  shonld  hnd  the  best  match.  For  this,  we  reqnire  a  means  of 
determining  whether  the  system  was  able  to  retrieve  the  best  possible  match.  That 
is,  we  reqnire  a  means  of  determining  whether  the  match  obtained  was  the  same  as 
the  match  an  ideal  retrieval  system  wonld  hnd.  We  nse  a  retrieval  system  employing 
all  the  original  featnres  (all  1800  DC  coefhcients)  and  the  Enclidean  distance  metric 
as  the  ideal  retrieval  system. 

Two  tests  of  indexing  and  retrieval  were  performed.  First,  a  test  was  performed  to 
see  that  if  a  qnery  that  is  clearly  very  similar  to  an  existing  index  in  the  database,  is 
posed  to  the  system,  the  system  retrieves  the  appropriate  match.  This  test  consisted 
of  329  qnery  seqnences,  one  for  each  key  frame  in  the  database,  formed  by  taking 
the  six  frames  immediately  following  each  key  frame.  In  all  the  test  clips,  1  frames 
occnrred  every  six  frames,  so  we  nsed  the  ‘how’  of  the  sixth  offset  frame  for  matching 
with  the  how  of  the  key  frame.  We  nsed  the  pivots  generated  by  FastMap  to  calcnlate 
the  coordinates  of  the  qnery  frames  and  then  fonnd  the  nearest  key  frame  point (s). 
A  snccessfnl  qnery  shonld  identify  the  key  frame  of  the  clip  and  retrieve  the  shot 
from  which  the  qnery  frame  was  taken.  The  two  parameters  that  were  varied  were 
the  nnmber  of  dimensions  FastMap  prodnced,  and  the  nnmber  of  nearest  neighbors 
retrieved.  Tests  were  carried  ont  for  FastMap  dimensions  4,  6,  8,  10,  and  15.  The 
nnmber  of  nearest  neighbors  retrieved  varied  from  1  to  3  (ordered  by  distance  to  the 
qnery  point). 

The  qnery  resnlts  can  be  categorized  into  three  types. 

Qneries  that  yielded  the  correct  answer  as  one  of  the  top  choices  (type  A).  By  a 
correct  answer,  we  mean  that  the  retrieved  key  frame  is  the  most  recent  key 
frame  in  the  temporal  ordering,  i.e,  it  is  the  key  frame  of  the  scene  in  from 
which  the  qnery  was  taken. 

Qneries  that  yielded  another  key  frame  from  the  same  clip  (type  B).  In  this  case, 
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the  retrieved  key  frame  was  not  the  most  recent  in  the  temporal  ordering,  bnt 
another  key  frame  from  the  same  set  of  key  frames  generated  from  that  clip. 
This  happened  primarily  in  two  sitnations.  (1)  When  there  were  many  shots 
(and  thns  key  frames)  that  had  exactly  the  same  content,  and  thns  the  qnery 
fonnd  a  match  with  one  of  those  alternate  shots  instead  of  the  shot  from  which 
the  qnery  was  taken.  (2)  When  one  continnons  shot  contained  many  scenes 
(and  thns  key  frames)  dne  to  changes  in  content  (e.g.  dne  to  camera  motion) 
and  where  the  key  frame  of  the  next  scene  was  more  similar  to  the  qnery  than 
the  key  frame  of  the  cnrrent  scene. 

Qneries  that  missed  (type  C).  None  of  the  top  choices  were  from  the  same  clip. 

Some  type  C  misses  can  be  jnstihed  becanse  we  had  many  clips  of  the  same  “pro¬ 
gram”,  a  football  game  for  example.  Fignre  11(a)  shows  a  line  plot  of  the  percentage 
of  qneries  that  were  missed  (type  C)  in  the  hrst  test  as  a  fnnction  of  the  nnmber  of 
dimensions  and  the  nnmber  of  top  choices  retrieved.  The  graph  shows  that  the  miss 
rate  drops  to  zero  for  jnst  the  top  choice  at  15  dimensions,  while  for  the  top  two  and 
three  choices  it  drops  to  zero  at  8  dimensions. 

Fignre  11(b)  shows  line  plots  of  the  percentages  of  qneries  that  yielded  correct 
resnlts  (type  A)  as  the  top  choice  (‘top  1’),  as  the  top  choice  when  nsing  flow 
(‘toplwithflow’),  and  when  retnrning  the  top  two  (‘top2’)  and  top  three  (‘top  3’) 
frames.  The  graphs  show  that  better  correct  retrieval  rates  and  lower  miss  rates  are 
achieved  by  increasing  FastMap  dimensions  or  nearest  neighbor  choices.  We  observe 
a  snbstantial  increase  in  correct  retrievals  for  the  ‘top  choice  with  flow’,  as  compared 
with  correct  retrievals  when  the  flow  information  is  not  ntilized.  The  percentages  of 
type  B  resnlts  can  be  calcnlated  from  the  type  A  and  type  C  percentages  shown  in 
the  two  plots.  From  the  graphs,  it  can  be  seen  that  we  were  able  to  attain  over  95% 
correct  recall  with  only  three  frames  retrieved. 

For  the  second  the  set  of  tests,  one  frame  was  selected  from  every  30  frames  in  a 
clip  as  a  single  qnery  frame,  yielding  a  total  of  473  test  frames.  This  experiment  was 
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Miss  percentages  for  Top  1,2,3  choices  for  'every  30  frames'  test 
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Figure  11:  Retrieval  results  on  video  :  (a)  Recall  miss  percentages  for  the  ‘frame 
offsets  f  to  6’  test,  (b)  Correct  retrieval  percentages  for  the  top  f,2,3  choices,  and 
also  for  the  top  choice  with  flow  conhrmation. 


conducted  to  study  how  the  technique  performs  on  simulated  random  queries.  In  this 
experiment,  a  query  could  be  quite  distant  from  the  the  key  frame  of  the  scene  that 
it  belongs  to.  Key  frame  retrieval  was  performed  using  each  of  these  query  frames 
and  its  ‘flow’.  By  using  the  frame  numbers  of  shot  boundaries,  correct  results  were 
identihed.  Tests  were  carried  out  for  4,  8,  fO,  and  f5  dimensions,  while  varying  the 
number  of  nearest  neighbors  from  f  to  3. 

Figures  f2(a)  and  (b)  show  the  graphs  for  the  ‘every  30  frames’  test.  The  miss  rate 
for  low  dimensions  is  quite  high  for  just  the  top  choice,  but  drops  to  an  acceptable 
level  for  higher  dimensions  with  the  top  three  choices.  Due  to  the  randomness  of  the 
queries,  one  would  not  expect  results  similar  to  those  of  the  ‘frames  f  to  6’  test.  As 
in  the  previous  test,  we  observe  an  increase  in  correct  retrieval  with  rearrangement  of 
the  top  choices  according  to  flow  similarity.  The  results  of  such  a  test  depend  largely 
on  the  set  of  key  frames  used.  The  more  the  set  of  key  frames  is  representative  of  the 
entire  content  of  the  video,  the  better  is  the  absolute  performance.  Only  a  relative 
comparison  with  some  ideal  retrieval  system  can  be  used  to  evaluate  the  performance 
of  our  technique. 

To  evaluate  the  accuracy  of  FastMap  we  compared  FastMap  retrieval  with  an  ideal 
retrieval  system  employing  the  Fuclidean  distance  metric  with  the  original  features. 
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Figure  12:  Retrieval  results  on  video  :  (a)  Recall  miss  percentages  for  the  ‘every  30 
frames’  test,  (b)  Correct  retrieval  percentages  for  the  top  f,2,3  choices,  and  also  for 
the  top  choice  with  flow  conhrmation. 


We  performed  the  ‘every  30  frames’  test  with  all  1800  DC  coefficients  and  used  a 
pure  Euclidean  distance  metric  to  hnd  the  nearest  neighbors  of  each  of  the  query  test 
frames.  The  percentage  of  queries  that  resulted  in  misses  (result  type  C)  was  13.5%. 
Figure  13  shows  the  plot  of  miss  percentages  for  each  of  dimensions  4,  8,  10,  and  15 
and  for  the  top  one  to  top  three  nearest  neighbors.  The  percentage  of  misses  obtained 
by  using  pure  Euclidean  distance  without  any  dimensionality  reduction  is  shown  by 
the  horizontal  line.  With  four  dimensions,  taking  the  top  two  choices  performed 
slightly  better  than  just  using  Euclidean  distance,  whereas  for  dimensions  8  and  10, 
the  top  two  choices  were  much  better.  For  15  dimensions,  even  the  topmost  choice 
was  sufficient  to  yield  better  performance  than  the  Euclidean  distance  metric. 

Figures  14(a)  and  (b)  illustrate  some  sample  query  results.  The  leftmost  frame  is 
the  query,  followed  by  the  three  top  matches  to  its  right.  With  the  above-mentioned 
number  of  key  frames  in  the  database,  these  experiments  show  that  queries  can  be 
processed  in  a  fraction  of  a  second  on  a  SunSPARC  20  workstation. 
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Miss  percentages  for  'every  30  frames’  test 


Figure  13:  Recall  miss  percentages  for  ‘every  30  frames  test’  showing  comparison  with 
Euclidean  distance  results. 

7  Conclusion 

We  have  presented  techniques  for  indexing  and  retrieval  of  MPEG-compressed  video 
directly  from  the  compressed  domain  without  performing  expensive  decoding  compu¬ 
tations.  Video  is  parsed  into  shots,  subshots,  and  scenes,  and  key  frames  are  selected. 
Features  are  then  extracted  from  these  key  frames.  We  have  discussed  ways  of  gen¬ 
erating  a  framework  in  which  the  1,  P,  and  B  frames  can  be  considered  equivalent, 
thereby  avoiding  any  restrictions  imposed  by  the  MPEG  encoding  process.  Using  the 
FastMap  algorithm,  the  DC  coefficients  of  the  key  frames  are  transformed  into  man¬ 
ageable  vectors  for  archiving,  and  indexing  is  performed  using  these  low-dimensional 
vectors.  The  motion  information  is  further  used  to  test  the  potential  candidates  to 
yield  more  robust  results. 
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Figure  14:  Retrieval  results  on  video  :  (a)  Example  1  from  the  ‘every  30  frames’  test, 
(b)  Example  2  from  the  ‘every  30  frames’  test. 
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